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Abstract Photometric redshifts (photo-z) are crucial 
to the scientific exploitation of modern panchromatic 
digital surveys. In this paper we present PhotoRAp¬ 
ToR (Photometric Research Application To Redshift): 
a Java/C+-1- based desktop application capable to solve 
non-linear regression and multi-variate classification prob¬ 
lems, in particular specialized for photo-z estimation. It 
embeds a machine learning algorithm, namely a multi¬ 
layer neural network trained by the Quasi Newton learn¬ 
ing rule, and special tools dedicated to pre- and post¬ 
processing data. PhotoRApToR has been successfully 
tested on several scientific cases. The application is avail¬ 
able for free download from the DAME Program web 
site. 
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1 Introduction 

The ever growing amount of astronomical data pro¬ 
vided by the new large scale digital surveys in a wide 
range of wavelengths of the electromagnetic spectrum 
has been challenging the way astronomers carry out 
their everyday analysis of astronomical sources and we 
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can safely assert that the human ability to directly visu¬ 
alize and correlate astronomical data has been pushed 
to its limits in the past few years. As a consequence of 
the fact that data have become too complex to be effec¬ 
tively managed and analysed with traditional tools, a 
new methodological shift is emerging and Data Mining 
(DM) techniques are becoming more and more popu¬ 
lar in tackling knowledge discovery problems. A typical 
problem which is addressed with these new techniques 
is that of the evaluation of photometric redshifts. The 
request for accurate photometric redshifts (photo-z) has 
increased over the years due both to the advent of a new 
generation of multi-band surveys (see for example Con¬ 
nolly et al. 1995, )I7]) and to the availability of large 
public datasets which allowed to pursue a wide vari¬ 
ety of scientific cases. Ongoing and future large-field 
public imaging projects, such as Pan-STARRS (Farrow 
et al. 2014, 2T] ). KiD10 (Kilo-Degree Survey), DES 
(Dark Energy Survey, [19]), the planned surveys with 
LSST (Large Synoptic Survey Telescope, Ivezic et al. 
2009, [31]) and Euclid (Red Book, [23]), rely on accu¬ 
rate photo-z to achieve their scientific goals. 

Photo-z are in fact essential in constraining dark 
matter and dark energy through weak gravitational lens- 
ing (Serjeant 2014, [H]), for the identification of galaxy 
clusters and groups (e.g. Capozzi et al. 2009, [T3]), for 
type la Supernovae, and for the study of the mass func¬ 
tion of galaxy clusters (Albrecht et al. 2006, jl], Peacock 
et al. 2006, [37], and Umetsu et al. 2012, [323), just to 
quote a few applications. Photometric filters integrate 
fluxes over a quite large interval of wavelengths and, 
therefore, the accuracy of photometric redshift recon¬ 
struction is worse than that of spectroscopic redshifts. 
On the other hand, in the absence of the minimal tele- 
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scope time necessary to determine spectroscopically the 
redshifts for all sources in a sample, photometric red- 
shifts methods provide a much more convenient way 
to estimate the distance of such sources. The physical 
mechanism responsible for the correlation existing be¬ 
tween the photometric features and the redshift of an 
astronomical source, is the change in the observed fluxes 
caused by the fact that, due to the stretch introduced by 
the redshift, prominent features of the spectrum move 
across the different filters of a photometric system. 

This mechanism implies a non-linear mapping be¬ 
tween the photometric parameter space of the galaxies 
and the redshift values. This non linear mapping func¬ 
tion can be inferred using advanced statistical and data 
mining methods in order to evaluate photometric esti¬ 
mates of the redshift for a large number of sources. 
All existing implementations can be broadly catego¬ 
rized into two classes of methods: theoretical and empir¬ 
ical. Theoretical methods use template based Spectral 
Energy Distributions (SEDs), obtained either from ob¬ 
served galaxy spectra or from synthetic models. These 
methods require an extensive a-priori knowledge about 
the physical properties of the objects, hence they may 
be biased by such information. They, however, repre¬ 
sent the only viable method when dealing with faint 
objects outside the spectroscopic limit (Hildebrandt et 
al. 2010, [27] and references therein). 

When accurate and multi-band photometry for a 
large number of objects is complemented by spectro¬ 
scopic redshifts for a statistically significant sub-sample 
of the same objects, empirical methods might offer greater 
accuracy. This sample needs, however, to be statisti¬ 
cally representative of the parent population. The spec¬ 
troscopic redshifts of this sub-sample are then used to 
constrain the fit of an interpolating function mapping 
the photometric parameter space. Different methods 
differ mainly in the way such interpolation is performed. 

From the data mining point of view, the evaluation 
of photo-z is a supervised learning problem (Tagliaferri 
et al. 2002, |32]), (Hildebrandt et al. 2010, [27], where 
a set of examples is used by the method to learn how 
to reconstruct the relation between the parameters and 
the target (Brescia 2012, [Bj|). In the specific case of 
photometric redshifts, the parameters are fluxes, mag¬ 
nitudes or colors of the sources while the targets are the 
spectroscopic redshifts. 

A con of this approach being that, as it happens for 
all interpolative problems, such methods may suffer to 
extrapolate and therefore they are effective only when 
applied to galaxies with photometry that lie within the 
range of fluxes/magnitudes and redshifts well sampled 
by the training set. In this paper we present PhotoRAp- 
ToR (Photometric Research Application To Redshift), 


namely a Java based desktop application capable to 
solve regression and classification problems which has 
been finely tuned for photo-z estimation. It embeds a 
Machine Learning (ML) algorithm, in the specific case 
a particular instance of a multi-layer neural network, 
and special tools dedicated to pre- and post-processing 
data. The machine learning model is the MLPQNA 
(Multi Layer Perceptron trained by the Quasi Newton 
Algorithm), which has proven to be particularly pow¬ 
erful photo-z estimator, also in presence of relatively 
small spectroscopic Knowledge Base (KB) (Cavuoti et 
al. 2012, [H]I. (Brescia et al. 2013, El)- The applica¬ 
tion is available for download from the DAME program 
web sitej^J This paper is organized as follows: in Sect. [ 2 ] 
we describe the Java application; in Sect. [3] we discuss 
in some details how the evaluation of photometric red¬ 
shifts is performed. Sect.[d]describe other functionalities 
provided by the application, while Sect. [5] is dedicated 
to a comparison between PhotoRApToR and an alter¬ 
native public machine learning tool. Finally in Sect. [6] 
we outline some lessons which were learned during the 
implementation of PhotoRaPToR and draw some fu¬ 
ture developments. 


2 PhotoRApToR 

Everyone who has used neural methods to produce pho¬ 
tometric redshift evaluation knows that, in order to 
optimize the results in terms of features, neural net¬ 
work architecture, evaluation of the internal and exter¬ 
nal errors, many experiments are needed. When cou¬ 
pled with the needs of modern surveys, which require 
huge data sets to be processed, it clearly emerges the 
need for a user friendly, fast and scalable application. 
This application needs to run client-side, since a great 
part of astronomical data is stored in private archives 
that are not fully accessible on line, thus preventing the 
use of remote applications, such as those provided by 
the DAMEWARE tool (Brescia et al. 2014, [9]). The 
code of the application was developed in Java language 
and runs on top of a standard Java Virtual Machine, 
while the machine learning model was implemented in 
C++ language to increase the core execution speed. 
Therefore different installation packages are provided 
to support the most common platforms. Moreover, the 
application includes a wizard, which can easily intro¬ 
duce the user through the various functionalities offered 
by the tool. The Fig. |T] shows the main window of the 
program. The main features of PhotoRApToR can be 
summarized as it follows: 

2 http://dame.dsf. unina.it/dame_photoz.html^photoraptor 
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Fig. 1 The PhotoRApToR main window. 


— Data table manipulation. It allows the user to navi¬ 
gate throughout his/her data sets and related meta¬ 
data, as well as to prepare data tables to be submit¬ 
ted for experiments. It includes several options to 
perform the editing, ordering, splitting and shuffling 
of table rows and columns. A special set of options 
is dedicated to the missing data retrieval and han¬ 
dling, for instance Not-a-Number (NaN) or not cal¬ 
culated/observed parameters in some data samples; 

— Classification experiments. The user can perforin 
general classification problems, i.e. automatic sepa¬ 
ration of an ensemble of data by assigning a common 
label to an arbitrary number of their subsets, each of 
them grouped on the base of a hidden similarity. The 
classification here is intended as supervised, in the 
sense that there must be given a subsample of data 
for which the right output label has been previously 
assigned, based on the a priori knowledge about the 
treated problem. The application will learn on this 
known sample to classify all new unknown instances 
of the problem; 

— Regression experiments. The user can perform gen¬ 
eral regression problems, i.e. automatic learning to 
find out an embedded and unknown analytical law 
governing an ensemble of problem data instances 
(patterns), by correlating the information carried 
by each element (features or attributes) of the given 
patterns. Also the regression is here intended in a 
supervised way, i.e. there must be given a subsam¬ 
ple of patterns for which the right output is a priori 
known. After training on such KB, the program will 
be able to apply the hidden law to any new pattern 
of the same problem in the proper way; 

— Photo-z estimation. Within the supervised regres¬ 
sion functionality, the application offers a special¬ 


ized toolset, specific for photometric redshift esti¬ 
mation. After the training phase, the system will be 
able to predict the right photo-z value for any new 
sky object belonging to the same type (in terms of 
photometric input features) of the Knowledge Base; 

— Data visualization. The application includes some 
2D and 3 D graphics tools, for instance multiple 
histograms and multiple 2D/3D scatter plots. Such 
tools are often required to visually inspect and ex¬ 
plore data distributions and trends; 

— Data statistics. For both classification and regres¬ 
sion experiments a statistical report is provided about 
their output. In the first case, the typical confusion 
matrix (Stehman 1997, [45] ) is given, including re¬ 
lated statistical indicators such as classification effi¬ 
ciency, completeness, purity and contamination for 
each of the classes defined by the specific problem. 
For what the regression is concerned, the applica¬ 
tion offers a dedicated tool, able to provide sev¬ 
eral statistical relations between two arbitrary data 
vectors (usually two columns of a table), such as 
average (bias), standard deviation (cr), Root Mean 
Square (RMS), Median Absolute Deviation (MAD) 
and the Normalized MAD (NMAD, Hoaglin et al. 
1983, [28]), the latter specific for the photo-z qual¬ 
ity estimation, together with percentages of outliers 
at the common threshold 0.15 and at different mul¬ 
tiples of a (Brescia et al. 2014, [ID]), (Ilbert et al. 
2009, [30]). 


In Fig. [2] the layout of a general PhotoRApToR experi¬ 
ment workflow is shown. It is valid for either regression 
and classification cases. 
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Fig. 2 The workflow of a generic experiment performed with PhotoRApToR. 


2.1 The Machine Learning model 

The core of the PhotoRApToR application is its ML 
model, for instance the MLPQNA method. It is a Multi 
Layer Perceptron (MLP; Rosenblatt 1961, [32]) neural 
network (Fig. [3]), which is among the most used feed¬ 
forward neural networks in a large variety of scientific 
and social contexts. The MLP is trained by a learning 
rule based on the Quasi Newton Algorithm (QNA). 

The QNA is a variable metric method for finding 
local maxima and minima of functions (Davidon 1991, 
[20]). The model based on this learning rule and on the 
MLP network topology is then called MLPQNA. QNA 
is based on Newton’s method to find the stationary 
(i.e. the zero gradient) point of a function. In partic¬ 
ular, the QNA is an optimization of Newton’s learning 
rule, because the implementation is based on a statisti¬ 
cal approximation of the Hessian of the error function, 
obtained through a cyclic gradient calculation. 

In PhotoRApToR the Quasi Newton method was 
implemented by following the known L-BFGS algorithm 


(Limited memory - Broyden Fletcher Goldfarb Shanno; 
Byrd 1994, [T2| ), which was originally designed for prob¬ 
lems with a very large number of features (hundreds to 
thousands), because in this case it is worth having an 
increased iteration number due to the lower approxi¬ 
mation precision because the overheads become much 
lower. This is particularly useful in astrophysical data 
mining problems, where usually the parameter space 
is dimensionally huge and is often afflicted by a low 
signal-to-noise ratio. 

The analytical description of the method has been 
described in the contexts of both classification (Brescia 
et al. 2012, [7]) and regression (Brescia et al. 2013, |8] 
and Cavuoti et al. 2012, El)- In the present work, we 
focus the attention on its parameter setup and correct 
use within the presented framework. 

3 Photometric redshift estimation 

In practice, the problem of photo-z evaluation consists 
in finding the unknown function which maps the photo- 
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Fig. 3 The typical topology of a generic feed-forward neural network, in this case representing the architecture of MLPQNA. 
In the simple example there are two hidden layers (the two blocks of dark gray circles) between the input (X) and output 
(Y) layers, corresponding to the architecture mostly used in the case of photo-z estimation. Arrows between layers indicate 
the connections (weights w) among neurons. These weights are changed during the training iteration loop, according to the 
learning rule QNA. 


metric set of features (magnitudes and/or colors) into 
the spectroscopic redshift space. If a consistent frac¬ 
tion of the objects with spectroscopic redshifts is avail¬ 
able, the problem can in fact be approached as a data 
mining regression problem, where the a priori knowl¬ 
edge (i.e. the spectroscopic redshifts forming the KB), 
is used to uncover the mapping function. This function 
can then be used to derive photo-z for objects with¬ 
out the spectroscopic counterpart information. With¬ 
out entering into much details, which can be found 
in the literature quoted below and in the references 
therein, we just outline that our method has been suc¬ 
cessfully used in many experiments done on different 
KBs, often composed through accurate cross-matching 
among public surveys, such as SDSS for galaxies (Bres¬ 
cia et al. 2014, [TUI), UKIDSS, SDSS, GALEX and 
WISE for quasars (many of the following figures are 
referring to this experiment; Brescia et al. 2013, 0), 
GOODS-North for the PHAT1 contest (Cavuoti et al. 
2012, [14]) and CLASH-VLT data for galaxies (Biviano 
et al. 2013, 0)- Other photo-z prediction experiments 
are in progress as preparatory work for the Euclid Mis¬ 
sion (Laureijs et al. 2011, 1331) an d the KiD^] survey 
projects. 

3.1 User data handling 

The fundamental premise to use PhotoRaPToR is that 
the user must preliminarily know how to represent the 
data and, as trivial as it might seem, it is worth to ex¬ 
plicitly state that the user must: (i) be conscious of the 
target of his experiment, such as for instance a regres¬ 
sion or classification; and (ii) possess a deep knowledge 
of the used data. In what follows we shall call features 
the input parameters (i.e., for instance, fluxes, magni¬ 
tudes or colors in the case of photo-z estimation). 

Data Formats 


3 http://www.astro-wise.org/projects/KIDS/ 


In order to reach an intelligible and homogeneous 
representation of data sets, it is mandatory to pre¬ 
liminarily take care of their internal format, to trans¬ 
form the pattern features, and to force them to as¬ 
sume a uniform representation before submitting them 
to the training process. In this respect real working 
cases might be quite different. PhotoRApToR can in¬ 
gest and/or produce data in any of the following sup¬ 
ported formats: 

— FITS [49]: tabular/image; 

— ASCII [2]: ordinary text, i.e. space separated values; 

— VOTableJ^j VO (Virtual Observatory) compliant XML- 
based documents; 

— CSV [ITj: Comma Separated Values; 

— JPEG [3HI: Joint Photographic Expert Group, as 
image output type. 

Missing Data 

Very frequently, data tables have empty entries (sparse 
matrix) or missing (lack of observed values for some fea¬ 
tures in some patterns). Missing values (Marlin 2008, 
[54] ) are frequently (but not always) identified by spe¬ 
cial entries in the patterns, like Not-A-Number, out-of¬ 
range, negative values in a numeric field normally ac¬ 
cepting only positive entries etc. Missing data is among 
the most frequent source of perturbation in the learn¬ 
ing process, causing confusion in classification experi¬ 
ments or mismatching in regression problems. This is 
especially true for astronomy where inaccurate or miss¬ 
ing data are not only frequent, but very often cannot 
be simply neglected since they carry useful informa¬ 
tion. To be more specific, missing data in astronomical 
databases can be of two types: 

— Type I: true missing data which were not collected. 
For instance a given region of the sky or a single ob¬ 
ject was not observed in a given photometric band, 
thus leading to a missing information. These missing 
data may arise also from the simple fact that data, 

4 http://www.ivoa.net/documents/VOTable/ 
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coming from any source and related to a generic 
experiment, are in most case not expressly collected 
for data mining purposes and, when originally gath¬ 
ered, some features were not considered relevant and 
thus left unchecked; 

— Type II: upper limits or non-detections (i.e. object 
too faint to be detected in a given band). In this case 
the missing datum conveys very useful information 
which needs to be taken into account into the fur¬ 
ther analysis. It needs to be noticed, however that, 
often upper limits are not measured in absence of 
a detection and therefore this makes these missing 
data undistinguishable from Type I. 

In other words, missing data in a data set might 
arise from unknown reasons during data collecting pro¬ 
cess (Type I), but sometimes there are very good rea¬ 
sons for their presence in the data since they result from 
a particular decision or as specific information about an 
instance for a subset of patterns (Type II). This fact im¬ 
plies that special care needs to be put in the analysis 
of the possible presence (and related causes) of miss¬ 
ing values, together with the decision on how to submit 
these missing data to the ML method in order to take 
into account such special cases and prevent wrong be¬ 
haviors in the learning process. 

Data entries affected by missing attributes, i.e. pat¬ 
terns having fake values for some features, may be used 
within the knowledge base used for a photo-z exper¬ 
iment. In particular they can be used to differentiate 
the data sets with an incremental quantity of affected 
patterns, useful to evaluate their noise contribution to 
the performance of the photo-z estimation after train¬ 
ing. Theoretically it has to be expected that a greater 
amount of missing data, evenly distributed in both train¬ 
ing and test sets, induces a greater deterioration in the 
quality of the results. This precious information may 
be indeed used to assign different indices of quality to 
the produced photo-z catalogues. The organization of 
data sets with different rates of missing data can be 
performed through PhotoRApToR by means of a series 
of options. 

The Fig. [4] shows the panel dedicated to define and 
quantify the presence of missing or bad data within the 
user tables. The panel allows: (i) to quantify the num¬ 
ber of wrong values to be retained/removed in/from 
the data patterns; (ii) to completely remove the data 
patterns affected by the presence of NaN occurrences; 
(Hi) to assign arbitrary symbols to wrong or missing en¬ 
tries in the dataset (i.e. symbols like “—999”, “NaN" 
or whatever). 


Data Editing 

At the PhotoRApToR core is the MLPQNA neu¬ 
ral model. In this respect, before launching any experi¬ 
ment, it may be necessary to manipulate data in order 
to fulfill the requirements in terms of training and test 
patterns (data set rows) and features (data set columns) 
representation as well as contents: (i) both the training 
and test data files must contain the same number of 
input and target columns, and the columns must be in 
the same order; (ii) the target columns must always be 
the last columns of the data file; (Hi) the input columns 
(features) must be limited to the physical parameters, 
without any other type of additional columns (like col¬ 
umn identifiers, object coordinates etc.); (iv) all input 
data must be numerical values (no categorical entries 
are allowed). 

The application makes available a set of specific op¬ 
tions to inspect and modify data file entries. Every time 
a new data table is loaded, a new window shows the 
complete table properties (Fig. [5]), for instance: name, 
metadata, path and the number of columns and rows. 

For a currently loaded table it is possible to select 
a subset of the needed columns. After the selection, a 
table subset is created and, if the option Row Shuffle 
is enabled, the subset rows are also randomly shuffled. 
The random shuffling operation is useful to avoid sys¬ 
tematic trends during the training phase and to ensure 
the homogeneity in the distribution of training and test 
patterns. This last property is, in fact, directly con¬ 
nected to the necessity to split the initial data into dis¬ 
joint data sets, to be used for the training and testing 
phases, respectively. This is a simple action made pos¬ 
sible by the Split option. When the table is selected in 
the Table List , the user must give two different names 
for the split files (in this case train and test) and two 
different percentages of the original data set. It is im¬ 
portant to observe that, generally speaking, in machine 
learning supervised methods three different subsets for 
every experiment are generally required from the avail¬ 
able KB: (i) the training set , to train the method in 
order to acquire the hidden correlation among the in¬ 
put features; (ii) the validation set , used to check and 
validate the training in particular against the loss of 
generalization capabilities (a phenomenon also known 
as overfitting); and (Hi) the test set , used to evaluate 
the overall performances of the model (Brescia et al. 
2013, [SJ). In the version of the MLPQNA model im¬ 
plemented in the PhotoRApToR application, the vali¬ 
dation is embedded into the training phase, by means of 
the standard leave-one-out k-fold cross validation mech¬ 
anism (Geisser 1975, [23]). 
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Fig. 4 Use of the NaN handling tool. After the definition of the NaN symbols, the user can generate a new dataset only with 
rows containing NaN elements or another one cleaned by the NaN presence. 


Therefore, before any photo-z experiment, it is needed 
to split the data set in only two subsets, for instance, 
the training and test sets. There is no any analytical 
rule to a priori decide the percentages of the splitting 
operation. According to the direct experience, an em¬ 
pirical rule of thumb suggests to use 80% and 20% for 
training and test sets, respectively (Kearns 1996, [52]). 
But certainly it depends on the initial amount of avail¬ 
able KB. For example also 60% vs 40% and 70% vs 30% 
could be in principle used in case of large datasets (over 
ten thousand patterns). The percentage depends also on 
the quality of the available KB. When both photome¬ 
try and spectroscopy are particularly clean and precise, 
with a high S/N, there could also be possible to obtain 
high performances by training just on half of the KB. 

On the other hand, the more patterns are available 
for test, the more consistent will be the statistical eval¬ 
uation of the experiment performances. 

Data Plotting 

Within the PhotoRApToR application there are also 
instruments, based on STILTS toolset (Taylor 2006, 
023), capable to generate different types of plots (some 
examples are shown in Fig. and[8|. These options 
are particularly suited during the preparation phase of 
the data for the experiments. 

The graphical options selectable by user are: 

— multi-column histograms; 

— multiple 2D and 3D scatter plots. 

Data Feature Selection 


Learning by examples stands for a training scheme 
operating under supervision of an oracle capable to pro¬ 
vide the correct, already known, outcome for each of the 
training sample. This outcome is properly a class or 
value of the examples and its representation depends 
on the available KB and on its intrinsic nature even 
though in most cases it is based on a series of numer¬ 
ical attributes, related to the extracted KB, organized 
and submitted in an homogeneous way. 

Therefore, a fundamental step for any machine learn¬ 
ing experiment is to decide which features to use as 
input attributes for the patterns to be learned. In the 
specific case of photo-z estimation, for a given data sets, 
it is necessary to inspect and check which types of fluxes 
(bands) and combinations (magnitudes, colors) is more 
effective. 

In practice, the user must maximize the information 
carried by hidden correlations among different bands, 
magnitudes and zspec available. In spite of what can be 
thought, not always the maximum number of available 
parameters should be suitable to train a machine learn¬ 
ing model. The experience demonstrates, in fact, that it 
is more the quality of data, than the quantity of features 
and patterns, the crucial key to obtain the best predic¬ 
tion results (Brescia et al. 2013, jSj). This phase is very 
time consuming and usually requires many tens or even 
hundreds of experiments. Of course, the exact number 
of experiments depends on a variety of factors, among 
which, the number of photometric bands and magni¬ 
tudes for which a high quality of zspec entries is avail¬ 
able in the KB; the photometric and spectroscopic qual¬ 
ity of the data, the type of magnitudes (i.e. aperture, 
total or isophotal magnitudes, etc.), the completeness of 
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Fig. 5 The main panel showing details about the loaded data 
table and the editing options. 


the spectroscopic coverage within the KB and the spec¬ 
troscopic range. In the authors experience, quite often, 
the optimal combination turned out to be the feature 
set obtained from the colors plus one reference magni¬ 
tude for each region of the electro-magnetic spectrum 
(broadly divided in UV, optical, Near Infrared, Far In¬ 
frared, etc.) [8]. This can be understood by remember¬ 
ing that colors convey more information than the sin¬ 
gle related magnitudes, since from the basic equation 
defining magnitudes it is easy to see that a magnitude 
difference corresponds to a flux ratio and hence in the 
derived colors an ordering relationship among features 
is always implicitly assumed. 


3.2 Performing experiments 

After having prepared the KB, the user should have two 
subset tables ready to be submitted for a photo-z exper¬ 
iment. By looking at the Fig. [2] the experiment consists 
of a pre-determined sequence of steps, for instance: (i) 
Training and validation of the model network; (ii) blind 
Test of the trained model network; (in) Run, i.e. the ex¬ 
ecution on new data samples of a well trained, validated 
and tested network. 

We outline that for the first two steps, the basic 
rule is to use disjointed but homogeneous data subsets, 
because all empirical photo-z methods in general may 
suffer to extrapolate outside the range of parameter dis¬ 
tributions covered by the training. In other words, out¬ 
side the limits of magnitudes and spectroscopic redshift 
(zspec) imposed by the training set, these methods do 
not ensure optimal performances. Therefore, in order 
to remain in a safe condition, the user must perform a 
selection of test data according to the training sample 
limits. 

None of the objects included in the training sample 
should be included in the test sample and, moreover, 
only the data set used for the test has to be used to 
generate performance statistics. In other words the test 
must be blind, i.e. containing only objects never sub¬ 
mitted to the network before. 

For what the training is concerned, this phase em¬ 
beds two processing steps: the training of the MLPQNA 
model network and its validation. It is in fact quite fre¬ 
quent for machine learning models to suffer of an over- 
fitting on training data, affecting and badly condition¬ 
ing the training performances. The problem arises from 
the paradigm of supervised machine learning itself. Any 
ML model is trained on a set of training data in order 
to become able to predict new data points. Therefore 
its goal is not just to maximize its accuracy on training 
data, but mainly its predictive accuracy on new data 
instances. Indeed, the more computationally stiff is the 
model during training, the higher would be the risk to 
fit the noise and other peculiarities of the training sam¬ 
ple in the new data |2ll . The technique implemented 
within PhotoRaPToR, i.e. the so called leave-one-out 
cross validation, does not suffer of such drawback; it can 
avoid overfitting on data and is able to improve the gen¬ 
eralization performance of the ML model. In this way, 
validation can be implicitly performed during training, 
by enabling at setup the standard leave-one-out k-fold 
cross validation mechanism [25| . The automatized pro¬ 
cess of the cross-validation consists in performing k dif¬ 
ferent training runs with the following procedure: (i) 
splitting of the training set into k random subsets, each 
one composed by the same percentage of the data set 
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(depending on the k choice); (ii) at each run the re¬ 
maining part of the data set is used for training and 
the excluded percentage for validation. While avoiding 
overfitting, the k-fold cross validation leads to an in¬ 
crease of the execution time estimable around k — 1 
times the total number of runs. 

Concerning the photo-z experiment setup, special 
care must be paid to the setup of the training parame¬ 
ters, because all the other use cases, for instance the 
Test and Run (i.e. the execution on new data), re¬ 
quire only the specification of the proper input data 
set, and to recall the internal model configuration as 
it was frozen at the end of training (Fig. [9]). We can 
group the MLPQNA model training parameters into 
three subsets: network topology , learning rule setup and 
validation setup. 

— Network topology. It includes all parameters re¬ 
lated to the MLP network architecture; 

— Number of input neurons. In terms of input data 
set it corresponds to the number of columns of 
the data table, (also named as input features of 
the data sample, i.e. number of fluxes, magni¬ 
tudes or colors composing the photometric in¬ 
formation of each object in the data), except 
for the target column (i.e. the spectroscopic red- 
shift), which is related to the single output neu¬ 
ron of the regression network. More in general, 
in the case of classification problems, the num¬ 
ber of output neurons depends on the number of 
desired classes; 

— Number of neurons in the first hidden layer. As a 
rule of thumb, it is common practice to set this 
number to 2 N + 1, where N is the number of 
input neurons. But it can be arbitrarily chosen 
by the user; 

— Number of neurons in the second hidden layer. 
This is an optional parameter. Although not re¬ 
quired in normal conditions, as stated by the 
known universal approximation theorem m , some 
problems dealing with a parameter space of very 
high complexity, i.e. with a large amount of dis¬ 
tribution irregularities, are better treated by what 
was defined as deep networks, i.e. networks with 
more than one computational (hidden) layer ,:Tj. 
As a rule of thumb, it is reasonable to set this 
number to N — 1, where N is the number of in¬ 
put neurons. But it is strongly suggested to use 
a number strictly lower than the dimension of 
the first hidden layer; 

— Number of neurons in the output layer. This num¬ 
ber is obviously forced to 1 for regression prob¬ 
lems, while in case of classification this quan¬ 


tity depends on the number of classes as present 
within the treated problem; 

— Trained network weights. This parameter is re¬ 
lated to the matrix of weights (internal connec¬ 
tions among neurons). A weight matrix exists 
only after having performed one training session 
at least. Therefore, this parameter is left empty 
at the beginning of any experiment. But, for all 
other use cases (Test or Run), it is required to 
load a previously trained network. However this 
parameter could also be used to perform further 
training cycles for an already trained network 
(i.e. in case of an incremental learning experi¬ 
ment) . 

— Validation setup: all parameters related to the op¬ 
tional training validation process; 

— Cross validation k value. When the cross valida¬ 
tion is enabled, this value is related to the auto¬ 
matic procedure that splits in different subsets 
the training data set, applying a k-step cycle in 
which the training error is evaluated and its per¬ 
formances are validated. Reasonable values are 
between 5 and 10, depending on the amount of 
training data used. The k-fold cross validation 
intrinsically tries to avoid overfitting. Nonethe¬ 
less, in rare cases (such as a wrong choice of the 
k parameter with respect to the train set dimen¬ 
sion), a residual overfitting may occur. There¬ 
fore if the user wants to verify it, he/she should 
simply inspect the results, usually by comparing 
train with test performance. Whenever training 
accuracy is much better than test one, this is 
a typical clue of overfitting presence. Therefore, 
when cross validation with a proper k choice 
is enabled, by definition, it should avoid such 
events. The k parameter choice is not determinis¬ 
tic, but regulated by a rule of thumb, depending 
on the amount of training patterns. We remind 
also that this value strongly affects the overall 
computing time of the experiment. 

— Learning rule setup. It includes all parameters 
related to the QNA learning rule; 

— Maximum number of iterations at each Hessian 
approximation cycle. The typical range for such 
value is [1000,10000], depending on the best com¬ 
promise between the requested precision and the 
complexity of the problem. It can affect the com¬ 
puting time of the training; 

— Number of Hessian approximation cycles. Namely 
the number of approximation cycles searching 
for the best value close to the Hessian of the er¬ 
ror. If set to zero, the max number of iterations 
will be used for a single cycle. At each cycle the 
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algorithm performs a series of iterations along 
the direction of the minimum error gradient, try¬ 
ing to approximate the Hessian value. A reason¬ 
able range is [20,60], although also in this case 
the exact value depends on the final precision re¬ 
quired. If set to a high value, it is recommended 
to enable the cross validation option (see below), 
to prevent overfitting occurrence; 

— Training error threshold. This is one of the stop¬ 
ping criteria of the algorithm (alternative to the 
couple of the parameters iterations and cycles). 
It is the training error threshold (a value of 0.001 
is typical for photo-z experiments). 

— Learning decay. This value determines the ana¬ 
lytical stiffness of the approximation process. It 
affects the expression of the weight updating law, 
by adding the term decay * \\networkweights\\ 2 . 
Its range may vary from a minimum value of 
0.0001 (very low stiffness) up to 1000.0 (very 
high stiffness). Also in this case if a very low 
value is adopted, it is recommended to enable 
the cross validation option (see below), to pre¬ 
vent overfitting occurrence. 

The error calculated by the MLPQNA model dur¬ 
ing the training is evaluated for all the presented input 
patterns in terms of the difference between the known 
target values and the calculated outputs of the model. 
The error function in the regression case is based on the 
Least Mean Square (LSE) + Tychonov regularization 
(22|. This function is defined as follows: 

E = E£Li (Vi -tj) 2 decay* ||IT|| 2 
2 ' 2 

where N is the number of input patterns, y and t are 
the network output and the pattern target respectively, 
decay is the decay input parameter and ||W|| the norm 
of the network weight matrix. 

Regularization of the weight decay is the most im¬ 
portant issue within the model mechanisms. When the 
regularization factor is accurately chosen, then the gen¬ 
eralization error of the trained neural network can be 
improved, and the training can be accelerated. If the 
best regularization parameter decay is unknown, it could 
be experimented by varying its value within the allowed 
range, from a weak up to the strong regularization. In 
order to achieve the weight decay rule, we internally 
minimize a more complex merit function: 


Here E is the training error, S is the sum of the 
squares of the network weights, and the decay coeffi¬ 
cient decay controls the amount of smoothing applied 
to the network. Optimization is performed from the ini¬ 
tial point and until the successful stopping of the opti¬ 
mizer has been reached. 

Searching for the best decay value is a typical trial-and- 
error procedure. It is usually performed by training the 
network with different values of the parameter decay , 
from the lower value (no regularization) to the infinite 
value (strongest regularization). By inspecting statisti¬ 
cal results at each stage of the procedure the overfitting 
tendency can be monitored by continuously changing 
the decay factor. A zero decay usually corresponds to 
an overfitted network. Very large decay means instead 
an underfitted network. Between these extreme values 
there is a range of networks which reproduce the dataset 
with different degrees of precision and smoothness. 

After having successfully terminated a training ses¬ 
sion, the model will produce (among several output 
files) a final network weight matrix (file by default called 
trainedWeights.txt ) and the network configuration setup 
(file by default called frozenJ.rainjnet.txt ), which can be 
used during next experiment steps (Test and Run use 
cases), together with the respective input data sets. 

3.3 Inspection of results 

Interpolative methods, such as MLPQNA, have the ad¬ 
vantage that the training set is made up of real objects. 
In this sense, any empirical method intrinsically takes 
into account effects such as the filter band-pass and flux 
calibrations, even though the difficulty in extrapolating 
to regions of the input parameter space not well sam¬ 
pled in the training data is one of the main drawbacks 

PH- 

This is why a strong requirement of empirical meth¬ 
ods is that the training set must be large enough to 
cover properly the parameter space in terms of col¬ 
ors, magnitudes, object types and redshift. If this is 
true, then the calibrations and corresponding uncer¬ 
tainties are well known and only limited extrapolations 
beyond the observed locus in color-magnitude space are 
required. Hence, under the conditions described above 
about the consistency of the training set, a realistic 
way to measure photometric uncertainties is to compare 
the photometric redshifts estimation with spectroscopic 
measures in the test samples. 

All individual experiments should be evaluated in a 
consistent and objective manner through an homoge¬ 
neous set of statistical indicators. We remark that all 
statistical results reported throughout this paper are 
referred to the blind test data sets only. In fact, it is 
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Fig. 9 An example of setup phase for a photo-z regression experiment. 


good practice to evaluate the results on data (i.e. the 
test set) which have never been presented to the net¬ 
work during any of the training or validation phases. As 
easy to understand, the combination of test and train¬ 
ing data might introduce a straightforward systematic 
bias which could mask reality. 

Within PhotoRApToR we use a specific algorithm 
to generate statistics. For each experiment, given a list 
of N blind test samples for z spec and z p hot, we define: 


Az n 


Az — %spec Zphot 

_ Zspec Zphot 


1 + z. 


spec 


where Az nor m is the normalized Az. By indicating 
with x either Az or Az norm , we calculate the following 


statistical indicators: 

Ei =i 


bias(x) = 


N 


r{x) = ^ 


N 


Xi - 


EL - 

N 


N 


RMS(x) 

MAD( x) 
NMAD(x) 



MedianQ x |) 

1.4826 x Median{\ x |) 


There is also a relation between the Root Mean 
Square (RMS) and the Standard Deviation a: RMS = 
\/inean 2 + <r 2 , but a 2 is the variance , so we have RMS = 
y/inean 2 + variance. Therefore, for a direct comparison 
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of results, in terms of distance of mo (m = 1,2,...) from 
the distribution of Az, it is much more precise to use 
the Standard Deviation as main indicator, rather than 
the simple RMS. 

There is often a confusion about the relation be¬ 
tween photometric and spectroscopic redshifts used to 
apply the statistical indicators. For instance, the perfor¬ 
mance could be very different if the simple Az is used 
instead of the Az norm . The idea is that the Az cannot 
represent the best statistical indicator in the specific 
case of photometric redshift prediction. 

The velocity dispersion error, intrinsically present 
within the photometric estimation, is not uniform over 
a wide range of spectroscopic redshift and therefore the 
related statistics is not able to give a consistent esti¬ 
mation. On the contrary, the normalized term Az nor m 
introduces a more uniform information, correlating in a 
more correct way the variation of photometric estima¬ 
tion, and thus permitting a more consistent statistical 
evaluation at all ranges of spectroscopic redshift. 

For what the analysis of the catastrophic outliers 
is concerned, according to [35], the parameter D 95 = 
Z\ g5 / (1 + z p hot) enables the identification of outliers 
in photometric redshifts derived through SED fitting 
methods (usually evaluated through numerical simula¬ 
tions based on mock catalogues). In fact, in the hy¬ 
pothesis that the redshift error Az norm is Gaussian, the 
catastrophic redshift error limit would be constrained 
by the width of the redshift probability distribution, 
corresponding to the 95% confidence interval, i.e. with 
Z\g .5 = 2cr (Az norm ). In our case, however, photo-z are 
empirical, i.e. not based on any specific fitting model 
and it is preferable to use the standard deviation value 
a ( Az norm ) derived from the photometric cross matched 
samples, although it could overestimate the theoretical 
Gaussian a , due to the residual spectroscopic uncer¬ 
tainty as well as to the method training error. There¬ 
fore, we consider as catastrophic outliers the objects 
with \Az norm \ > 2cr ( Az norm ). This although it is com¬ 
mon practice to indicate as outliers all objects with 
\Az norm \ > 0.15, (thus included in the provided statis¬ 
tics). 

It is also important to notice that for empirical meth¬ 
ods it is useful to analyze the correlation between the 
NMAD ( Az nor m ) = 1-48 x median (\Az norrn \) and the 
standard deviation o c i ean (Az norm ) calculated on the 
data sample for which \Az norm \ < 2 a ( Az norm ). In fact, 
the quantity NMAD is smaller than the value of the 
a clean- In such condition we can assert that the pseudo- 
gaussian distribution of ( Az norm ) is mostly influenced 
by the presence of catastrophic outliers. 

All the described statistical indicators are provided 
by PhotoRaPToR as the output of every photo-z esti¬ 


mation test and are stored in specific files (by default 
named as teststatistics.txt). For completeness we also 
provide a similar statistics file as the output of any 


training session (Fig. 10). But its use is only to allow a 
quick comparison between training and test, just in or¬ 
der to verify the absence of any overfitting occurrence. 
Besides the statistics files, PhotoRApToR makes also 
available some graphical tools, useful to perform a vi¬ 
sual inspection of photo-z experiments. In particular a 
2D scatter plot to show the trend of photo-z vs zspec 


(Fig. 11), as well as a set of histograms useful to graph¬ 
ically evaluate the distributions of quantities Az and 

/\r 

norm, • 


4 Other functionalities 

To complete the description of the resources made avail¬ 
able by PhotoRApToR, we wish to stress that besides 
photometric redshift estimation (to be intended as a 
specific type of regression experiment), the user has 
the possibility to perform generic regression as well as 
multi-class classification experiments. 

For a generic regression problem, all the above func¬ 
tionalities described in the case of the photo-z, remain 
still valid, with the only straightforward exception for 
the statistics produced, which is generated for generic 
quantities formulated below. 


Aout. 


Aout = target — output 
target — output 


1 + target 


Also in the case of the multi-class classification, the 
above considerations and options remain still valid with 
only some differences, described in what follows. 


During the training setup (Fig. 12), there are two 


specific options, not foreseen for regression problems: 


— Output neurons. The number of neurons of the out¬ 
put layer (which is forced to be 1 in the regression 
experiments), in this case corresponds to the num¬ 
ber of different classes present in the training sam¬ 
ple. It is required that the class identifiers should 
have a binary format label. For instance, in a three- 
class problem, the target classes are represented in 
three columns labeled respectively, as ( 100 ), ( 010 ) 
and ( 001 ); 

— Cross entropy: this optional parameter, if enabled, 
replaces the standard training error evaluation (for 
instance the MSE between output and target val¬ 
ues). Its meaning is discussed below. 
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Fig. 10 The statistics produced at the end of a photo-z regression experiment. The training and test results are also auto¬ 
matically stored in the files train_statistics.txt and test_statistics.txt, respectively. 


The Cross Entropy (CE) error function was intro¬ 
duced to address classification problem evaluation in a 
consistent statistical fashion :43j. The CE method con¬ 
sists of two phases: (i) it generates a random data sam¬ 
ple (trajectories, vectors, etc.) according to a specified 
mechanism; (ii) it updates the parameters of the ran¬ 
dom mechanism based on the data to produce a better 
sample in the next iteration. 

In practice, a data model is created based on the 
training set, and its CE is measured on a test set to 
assess how accurately the model is predicting the test 
data. The method compares indeed two probability dis¬ 
tributions, p the true distribution of data in the data 
set, and q which is the distribution of data as predicted 
by the model. Since the true distribution is unknown, 


the CE cannot be directly calculated, while an estimate 
of CE is obtained using the following expression: 

N 1 

H (T,q) = -^2 l °g- 2 q ( x i) 

i—1 1 

where T is the chosen training set, corresponding to the 
true distribution p , N is the number of objects in the 
test set, and q ( x ) is the probability of the event x as 
estimated from the training set. 

Another difference with respect to regression exper¬ 
iments is of course the statistics produced to evaluate 
the results outcoming from a classification experiment. 
In this case, at the base of the statistical indicators 
adopted, there is the commonly known confusion ma¬ 
trix , which can be used to easily visualize the classi- 
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Fig. 11 The photo-z vs zspec plot as produced after a photo-z regression experiment. In this example the diagram shows 
both training (black dots) and test (gray crosses) objects, although the blind test objects are the most relevant to evaluate 
the prediction performances. 


fication performance j40] : each column of the matrix 
represents the instances in a predicted class, while each 
row represents the instances in the real class (Fig. 13). 
One benefit of a confusion matrix is the simple way 
in which it allows to see whether the system is mixing 
different classes or not. 

More specifically, for a generic two-class confusion 
matrix, 



OUTPUT 


- 

Class A 

Class B 

TARGET 

Class A 

Naa 

Nab 


Class B 

Nba 

Nbb 


in that class and the total number of objects of that 
class in the data set. In our confusion matrix exam¬ 
ple it would be: 


cmpA 


Naa 

Naa + Nab 


cmpB 


Nbb 

Nba + Nbb 


— contamination of a class : cntN. It is the dual of the 
purity, namely it is the ratio between the misclas- 
sified objects in a class and the number of objects 
classified in that class, in our confusion matrix ex¬ 
ample will be: 


we then use its entries to define the following statistical 
quantities: 


cut A = 1 


pcA = 


Nba 

Naa + N B a 


— total efficiency: te. Defined as the ratio between the 
number of correctly classified objects and the total 
number of objects in the data set. In our confusion 
matrix example it would be: 

^ __ Naa + N B b _ 

Naa + Nab + N B a + N BB 

— purity of a class: pcN. Defined as the ratio between 
the number of correctly classified objects of a class 
and the number of objects classified in that class. In 
our confusion matrix example it would be: 

vcA _ Naa 

Naa + N B a 


pcB 


Nbb 

Nab + Nbb 


— completeness of a class: cmpN. Defined as the ratio 
between the number of correctly classified objects 


cntB = 1 — pcB 


Nab 

Nab + Nbb 


All these statistical indicators are packed in an out¬ 
put file, produced at the end of the test phase of any 
classification experiment. 

The MLPQNA machine learning method, embed¬ 
ded into PhotoRaptor, has been already tested in sev¬ 
eral classification cases. In Brescia et al. 2012, 7], we 
compared the performances of MLPQNA with other 
machine learning based classifiers and traditional tech¬ 
niques as well, in terms of accuracy of identifying can¬ 
didate globular clusters in the NGC 1399 HST single¬ 
band data. In Cavuoti et al. 2014, [T5j, we compared 
MLPQNA with standard MLP and Support Vector Ma¬ 
chine to photometrically classify AGNs in the SDSS 
DR4 archive. Finally, we recently have exploited the 
MLPQNA to perform classification experiments within 
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Fig. 12 The setup panel of a multi-class classification experiment. It is also possible to assign arbitrary class labels to all 
output instances in the training and test sets (see subpanel Assigning Classes). 


SDSS DR10 archive, aimed at photometrically identify¬ 
ing quasars from the whole sample including also galax¬ 
ies and stars, as well as to verify the possibility to dis¬ 
entangle normal galaxies from objects with a peculiar 
spectrum, (Brescia et al. 2015, [TTj ). 

5 Comparison with public machine learning 
tools 

We performed a simple comparison between PhotoRAp- 
ToR and an alternative machine learning tool publicly 
available: the scikit-learn toolset [38]. The comparison 
is based on the photo-z estimation by means of a super¬ 
vised non-linear regression experiment, by directly com¬ 
paring the statistical performances between the MLPQNA 


model provided through PhotoRApToR and the widely 
known ensemble method based on Random Forest [5], 
which uses a random subset of candidate data features 
to build an ensemble of decision trees. 

The data set used for the experiment was obtained 
by merging the photometry from four different surveys 
(UKIDSS, SDSS, GALEX and WISE), including de¬ 
rived colors and reference magnitudes for each band as 
internal features, thus covering a wide range of wave¬ 
lengths from the UV to the mid-infrared. While the 
spectroscopic redshifts, (i.e. the zspec target values) 
were derived from selected quasars of the SDSS-DR7 
database. The complete KB consisted of ~ 1.4 x 10 4 
objects, from which the 60% used as training set and 
the residual 40% as blind test set (see Brescia et al. 



















































Photometric redshift estimation based on data mining with PhotoRApToR 


17 



Fig. 13 The statistics produced at the end of a 2-class classification experiment. 


2013, [3], for more details). We remark also that in 
that case, our MLPQNA has been directly compared 
with other several photo-z estimation methods (see ref¬ 
erences therein), achieving best results. 


After having trained the two ML models with the 
same training set, their photo-z estimation results have 
been compared in terms of statistics and residual analy¬ 


sis (outlier percentages). The results are shown in Fig. 14 


and reported in Tab. jT] From the comparison, it results 
apparent that MLPQNA performs better than Ran¬ 
dom Forest, especially in the high-redshift zone (i.e. 
at zspec > 2.0), showing a more robust prediction ca¬ 
pability also in the sparsely populated regions of the 
parameter space. 


In addition, unlike the PhotoRApToR resource, in 
order to setup and run the Random Forest model pro¬ 


vided by the scikit-learn package, as well as to prepare 
and execute the experiments, some manipulations of 
the source code have been necessary. The reason is that 
the scikit-learn package is provided as a library to be 
imported in a user-defined script code, which implies 
a certain knowledge of the Python programming lan¬ 
guage. 

Although we reported a use case example where 
PhotoRaptor has been tested on a dataset of about 
10 4 samples, we want to emphasize that the reliability 
of our resource has been already verified for data sets 
up to ~ 10 6 samples. However, in such cases the com¬ 
putational cost of the experiment becomes very high, 
although the regression accuracy does not seem to re¬ 
quire such amount of data in the training set. Therefore, 
as general rule of thumb, a good compromise between 
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Fig. 14 The photo-z vs zspec scatter plots as produced after the photo-z estimation experiment. The upper plot refers to the 
Random Forest model while the lower one is related to the MLPQNA model results. Both diagrams show the distributions of 
the ~ 5.7 x 10 3 objects composing the blind test set. 


Table 1 Comparison of the performances among the different tools. MLPQNA is the ML engine of our application, based on 
a four-layers neural network, while Random Forest is the ML model provided by the scikit-learn public resource. Both methods 
were trained on the multi-survey mixed (colors + reference magnitudes) dataset, obtained by cross-matching photometry of 
UKIDSS, SDSS, GALEX and WISE surveys. The reported statistics is related to the photo-z estimation on the blind test set 
of about ~ 5.7 X 10 3 QSO objects. For the definition of the parameters and for discussion see text. 


Photo-z Estimation Statistics 



AZnorm 


Model 

BIAS 

a 

MAD 

RMS 

NMAD 

PhotoRApToR (MLPQNA) 

0.004 

0.069 

0.020 

0.069 

0.029 

Scikit (Random Forest) 

0.009 

0.083 

0.021 

0.084 

0.031 

Outlier percentages [%] 



| AZnorm \ 


Model 

> 0.15 

> 1<7 

> 2<j 

> 3 a 

> 4cr 

PhotoRApToR (MLPQNA) 

2.43 

9.39 

2.89 

1.40 

0.91 

Scikit (Random Forest) 

5.27 

11.03 

4.48 

2.31 

1.34 


computational time and performance could be to limit 
the training sample to about 10 5 samples. 

In addition, our model MLPQNA has been tested 
in a public photo-z contest (PHATI, Hildebrandt et 
al. 2010, [27] . and Cavuoti et al. 2012, [14]), resulting 
as one of the best interpolative methods. In another 
work (Brescia et al. 2014, [10]) we published a catalogue 
of photometric redshifts for the SDSS DR9 release, by 
comparing our prediction accuracy with other machine 
learning methods. More recently PhotoRaptor has been 
used by an independent group (Hoyle et al. 2014, [25]), 
that performed a regression feature analysis with SDSS 


DR10 galaxies by comparing our resource with random 
forest (AdaBoost, [22] ) and FANN artificial neural net¬ 
works (Nissen 2003, [55]). 


6 Perspective and conclusions 

Driven by the advances in the digital detectors and 
computing technology, astronomy has become an im¬ 
mensely data-rich science. This exponential data avalanche 
continues. It enables some exciting new science, but 
poses many non-trivial challenges that are common to 
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many other data-driven fields. Nowadays the technolog¬ 
ical evolution of astronomical instruments has been so 
fast to render physically impossible to move the data 
from their original repositories. The real goal of sci¬ 
ence, namely data analysis and knowledge discovery, 
begins after all the data processing and data delivery 
through the archives. This requires some powerful new 
approaches to data exploration and analysis, leading to 
knowledge discovery and understanding. This implies 
that, as it has always been asked for but never imple¬ 
mented, we must be able to move the programs and not 
the data. Therefore, the future of any data-driven ser¬ 
vice depends on the capability and possibility of moving 
the data mining applications to the data centers hosting 
the data themselves. In such scenario, PhotoRApToR 
represents our test bench of a desktop application pro¬ 
totype capable to fulfill this concept. As a final perspec¬ 
tive, we want to address the still open problem to find 
an efficient, reliable and standard way to provide single 
photo-z errors in the case of interpolative methods. We 
have recently started to investigate such problem and 
intend to improve PhotoRaptor in the next future with 
such kind of a tool. 
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